Database model for hierarchical data formats

ABSTRACT

The present invention relates to a method for mapping a hierarchical data format to a relational database management system. It is an object of the invention to provide a method for mapping a hierarchical data format comprising descriptors to a relational database management system capable of handling diverse types of hierarchical descriptors in a fast manner for inserting descriptors, reading parts of descriptors, reading whole descriptors and performing fast text queries. According to the invention, the descriptors are separated into portions of a common format, which are stored in a relation in the relational database.

The present invention relates to a method for mapping a hierarchicaldata format to a relational database management system. Furthermore, thepresent invention relates to a database model and an apparatus forreading from and/or writing to recording media using such method.

The future of digital recording will be characterised by thepreparation, presentation and archiving of added value data services,i.e. a recorder, like a DVR (Digital Video Recorder) for example, willstore and handle additional information delivered by content providerslike broadcasters or special services or even assembled by the userhimself. Added value (metadata) is generated to give further informationto the user. For example, added value may be a movie summary explainingthe story, a listing of the actors etc. Also the provision of additionalinformation facilitating navigation inside the movie constitutes addedvalue. For example, a movie can be structured into sections, subsectionsetc. each having an individual title and possibly comprising furtheruseful information.

For providing structural information and for transporting other metadatafor multimedia objects like video or audio streams, an hierarchical dataformat is generally used. A well-known and widely accepted hierarchicaldata format is the extensible markup language XML. XML is a system fordefining specialized markup languages that are used for transmittingformatted data. It is, therefore, also called a meta language, alanguage used for creating other specialized languages. XML dataconsists of text, which is organised in form of a plurality ofdescriptors. The text itself contains elements, attributes and content,i.e. the remaining text. Besides the use for multimedia objects, manyother applications for XML are known.

It is to be expected that in the foreseeable future digital recorderswill store quite a large amount of data in XML or another hierarchicaldata format in relational databases, since these databases are widelyused and quite sophisticated. However, there is the problem that forstorage the hierarchical data format has to be mapped to a relationaldatabase management system (RDBMS). A number of database models for XMLhave already been proposed. See for example Rahayu et al.:Representation of multilevel composite objects in relational databases.OOIS'98, Proceedings of the 1998 International Conference on ObjectOriented Information Systems, pp 221-238, or Zhang et al.: On SupportingContainment Queries in Relational Database Management Systems, ACM.Sigmod Record, vol. 30, no. 2 (2001), pp. 425-36. However, no databasemodel is known capable of handling diverse types of hierarchicaldescriptors in a fast manner for inserting descriptors, reading parts ofdescriptors, reading whole descriptors and performing fast text queries.

It is, therefore, an object of the invention to provide a method formapping a hierarchical data format comprising descriptors to arelational database management system. It is another object of theinvention to provide a database model and an apparatus for reading fromand/or writing to recording media using such method.

According to the invention, the descriptors are separated into portionsof a common format, which are stored in a relation in the relationaldatabase. The method has the advantage that it is independent of thestructure of the stored descriptors. Only a restricted number of commonformats is required for storing all types of descriptor formats. Thecommon formats comprise, for example, elements, attributes, text etc. Inthis way each descriptor is analysed word by word, separated into itsdifferent components, and stored in the relation, which preferably is atable.

The method can be further improved by providing independent relationsfor the common formats. Every query uses only these relations. Forexample, a first relation contains only text, while a second relationcontains elements etc. This enables fast and simple queries due to therestricted number of relations. If, for example, a text query has to beperformed, only the relation containing text has to be searched. Whileit is advantageous to provide independent relations for all commonformats, it is likewise possible to use a relation for more than onecommon format. For example, elements and attributes can be storedtogether in a first relation, while text is stored in a second relation.

According to a refinement of the invention, the method further comprisesthe step of storing information allowing recovery of the descriptorstructure in the relations. When a query delivers only a single databaseentry, the complete structure of the descriptor belonging to thespecific database entry can be recovered.

Advantageously, the information allowing recovery of the descriptorstructure comprises descriptor numbers and relative and/or absolutepositions of the portions of a common format within the descriptors.Using this information it is possible to collect the appropriate valuesfrom the database and to sort these values in a useful manner. Everytime a descriptor is stored in the database, it receives a univocaldescriptor number. In addition, for every portion of a common format ofthe descriptor the relative position within the descriptor and/or theabsolute position within the relation is derived. The descriptor numbersand the relative and/or absolute positions are stored in the relationstogether with the portions of a common format.

Favourably, the information allowing recovery of the descriptorstructure further comprises an indicator for the next upper hierarchicallevel of the portions of the common format within the descriptors. Thisfacilitates a fast reconstruction of descriptor parts by starting froman arbitrary part of the descriptor back (level oriented) to the head ofthe descriptor. The next upper hierarchical level is a helpfulinformation for reconstructing a descriptor part when only the relativeor absolute word position of a portion of a common format is known, forexample as a query result.

According to another aspect of the invention, the method furthercomprises the step of storing a descriptor index in the relationaldatabase. Such a descriptor index allows to store additional informationfor every descriptor and to easily find a specific descriptor in thedatabase.

Advantageously, the descriptor index comprises at least descriptornumbers, absolute positions of the descriptors within the relationsand/or unique identifiers for the descriptors. Storing this informationin the descriptor index allows fast access to a specific descriptor inthe relations. The absolute position of a descriptor within therelations is favourably defined as the absolute position of its firstportion of a common format. Since the unique identifiers are oftenneeded, a faster access to this kind of data is provided by storing theunique identifier in the descriptor index. In addition to the mentionedinformation, other types of information can be stored in the descriptorindex, like for example the number of levels of a descriptor or otheruseful data.

Favourably, the hierarchical data format comprising descriptorscorresponds to the extensible markup language. Since XML is widely usedand well accepted, this allows a wide range of applications of theinventive method.

According to the invention, the common formats comprise at leastelements, attributes and text. These types of common formats aresufficient for many applications. While the elements are mainly used forstructuring the descriptors, the text contains the information which isin general searched in a query. Attributes are mostly used forcharacterising elements.

Favourably, the common format text is further divided into string valuesand integer values. In this way faster searches can be achieved, sincethe relations which have to be searched for a query become smaller. Aquery for a string value, for example, is performed in the relationcontaining only string values, which contains less elements than arelation containing both string and integer values.

Advantageously, the common formats further comprise namespaceinformation. This feature is especially interesting for XML and allowsto prevent collisions between different documents when markup intendedfor one document uses the same element types or attribute names asanother document for different purposes.

Favourably, a database model for mapping a hierarchical data formatcomprising descriptors to a relational database management system uses amethod according to the invention. Such a database model is capable ofrealizing simple and fast queries, flexible handling of diversedescriptor formats, simple and fast reconstruction of descriptors, andsimple and fast insertion of descriptors. In addition, such a databasemodel can easily be implemented with existing relational databasemanagement systems.

Advantageously, an apparatus for reading from and/or writing torecording media uses a method or a database model according to theinvention for mapping a hierarchical data format comprising descriptorsto a relational database management system. Such an apparatus allows tostore added value information in an existing relational database. A userof the apparatus can easily use and/or edit the added value information.

For a better understanding of the invention, exemplary embodiments arespecified in the following description of advantageous embodiments withreference to the figures, using XML as an example for an hierarchicaldata format. It is understood that the invention is not limited to theseexemplary embodiments and that specified features can also expedientlybe combined and/or modified without departing from the scope of thepresent invention. In the figures:

FIG. 1 a, 1 b show a simplified XML descriptor and its representation asan XML tree,

FIG. 2 shows a database model according to the invention using a singlerelation,

FIG. 3 shows a database model as in FIG. 2, but wherein additionalinformation on the descriptor structure is stored,

FIG. 4 a, 4 b show the representations of an XML descriptor as in FIG.1, but wherein the text comprises string values and integer values,

FIG. 5 shows a database model similar to FIG. 2, but wherein elements,attributes, integer values, and string values are separated intodifferent relations,

FIG. 6 shows a database model similar to FIG. 3, but wherein repetitionsinside the relations are eliminated by providing additional relations,

FIG. 7 shows a typical metadata descriptor comprising namespaceinformation, a unique identifier and links to other metadatadescriptors,

FIG. 8 shows a metadata stream comprising a plurality of metadatadescriptors, and

FIG. 9 shows a database model according to FIG. 6 comprising adescriptor index.

FIG. 1 shows in part a) a simplified example of an XML descriptor 10 andin part b) the corresponding representation as an XML tree. As can beseen from the figure, the exemplary descriptor 10 comprises a section, asubsection and a sub-subsection, each having a title. The title of thesub-subsection has an attribute “arrow” with the value “down”. Thedescriptor 10 consists of a total of 17 words, wherein the text of eachtitle counts as a single word, independent of the actual number ofwords. For example, “Leonardo is swimming” is a single “logical” word,though it comprises three “actual” words. The number given in each lineof the descriptor 10 in part a) of the figure is the relative wordposition of the first word of each line within the descriptor 10. Fromthe corresponding tree structure in part B of the figure, it can be seenthat the descriptor 10 has five levels, namely level 0 to level 4. Thetree structure is a helpful tool for illustrating the hierarchicalrelations between the different words of the descriptor 10.

In FIG. 2 a database model according to the invention is shown, whereina single relation 20 is used. The relation 20 is represented by a table.The first column “Value” indicates the stored portion itself (the XMLstring). The second column “Descr#” indicates the univocal descriptornumber inside the database management system. The column “Word Pos.”contains the relative position of the stored part within the specificdescriptor 10. “Descr#” and “Word Pos.” taken together are a primary keyof the relation 20, allowing the complete recovery of a descriptor 10.The type of each XML string is inclusively stored in the relation in thecolumn “Type”. In the example, the types comprise “element”, “attribute”and “text”. The last column “Level” contains the hierarchical level ofeach XML string as shown in FIG. 1B. As can be seen, not all words ofthe descriptor 10 are stored in the relation 20. The “closing” wordslike </title> and </section> do not contain additional information andare not necessarily needed for recovery of the descriptor 10. They are,therefore, not stored in the database. It is, of course, possible toalso store these words if necessary.

FIG. 3 shows a database model similar to FIG. 2, but wherein anadditional column “Next Upper Word Pos.” is included in the relation 21,which contains an indicator for the next upper hierarchical word of theXML string within the specific descriptor 10. This is a helpfulinformation for recovering a descriptor part when only the word positionof a portion of a common format is known, for example as a query result.A fast reconstruction of descriptor parts is facilitated by providingthis additional information.

In FIG. 4 another simplified descriptor 11 similar to the one in FIG. 1is shown. However, in this example the text consists of string valuesand integer values. As can be seen from part B of the figure, stringvalues and integer values are separated and count as distinct “logical”words.

FIG. 5 depicts a database model similar to the one shown in FIG. 2.However, in this example the XML strings are separated into elements,attributes, string values and integer values, and stored in differentrelations 22, 23, 24, 25. This allows for faster searches inside therelations 22, 23, 24, 25. Due to the descriptor number and the wordposition it is still possible to recover the complete descriptor 11 fromthe different relations 22, 23, 24, 25. A value “Type” is not necessaryin this embodiment, since every relation 22, 23, 24, 25 contains only aspecific type.

In FIG. 6 a further refinement of a database model according to theinvention is shown. The database model is similar to the one shown inFIG. 3, however, repetitions inside the relation 31 are eliminated. Thisis achieved by providing additional relations 32, 33, 34, 35 (“secondaryrelations”) for the elements, string and integer values, and attributes.For each XML string a value “Type” and a corresponding descriptor key“Descr. Key” are included in the “primary” relation 31. The descriptorkey indicates the corresponding entry in the additional relation 32, 33,34, 35 for the specific type of XML string. The columns “Type” and“Descr. Key” taken together can be regarded as a secondary key, sincethey link each XML string specified by a primary key with the specificvalue.

FIG. 7 shows a typical metadata descriptor 1. The actual content of themetadata descriptor is contained in the core 6. In addition, themetadata descriptor 1 comprises namespace declarations 2, a uniqueidentifier 4 and links 5 to other metadata descriptors. The namespacedeclarations 2 and the unique identifier 4 are stored in special placesinside the database management system since they are often needed. Theintention is to provide a fast access to this kind of data. Thenamespace declarations 2 are only valid for the specific metadatadescriptor 1. The unique identifier 4 allows an unambiguousidentification of the metadata descriptor 1.

FIG. 8 depicts a metadata stream 7 comprising a plurality of metadatadescriptors 1 like the one shown in FIG. 7. In addition, the metadatastream 7 comprises namespace declarations 2, which are valid for allmetadata descriptors 1 inside the specific metadata stream 7.

In FIG. 9 the use of a descriptor index 40 is shown. The descriptorindex 40 contains for each descriptor stored in the database thedescriptor number, the number of levels of the descriptor (“Max Level”),its unique identifier (“UUID”) and its absolute position (“Abs. Pos.”)within the relation 41. The corresponding relation 41 is similar to theone shown in FIG. 6. However, it further comprises the absolute positionof each XML string and the namespace declarations. The additionalrelations containing elements, string values, integer values and so on,which are addressed by the secondary key, are not shown for sake ofsimplicity.

The database models shown in the figures have a plurality of advantages,such as:

-   -   The flexibility to store all kind of descriptors by providing a        separation of the incoming XML stream into common formats.    -   Fast queries due to the restricted number of relations. For        example, a text query has to be performed only in a small number        of relations, like “string value” or “element”, i.e. only in        such relations where strings are stored.    -   Fast implementation of such a database model into a database        management system due to the restricted number of relations.        Other database models need at least one relation for each        descriptor type.    -   Fast recovery of descriptors back to XML format due to the        specific modelling of the database, i.e. by using the        attributes.“Descr#” and “Word pos.”.    -   Fast recovery of descriptor parts by providing the additional        information “Next Upper Word Pos.”. It is helpful when starting        from an arbitrary part of the descriptor back (level oriented)        to the head of the descriptor.

1. Method for mapping a hierarchical data format with descriptors to arelational database management system, including the steps of:separating the descriptors into portions of a plurality of commonformats; storing the portions of the plurality of common formats inrelations in the relational database; and storing information describingthe descriptor structure in the relations together with the portions ofthe plurality of common formats; wherein the information describing thedescriptor structure includes an indicator for the next upperhierarchical level of portions of a common format within thedescriptors.
 2. Method according to claim 1, wherein the informationdescribing the descriptor structure includes descriptor numbers andrelative and/or absolute positions of portions of a common format withinthe descriptors.
 3. Method according to claim 1, further comprising thestep of providing independent relations for the common formats. 4.Method according to claim 1, further comprising the step of storing adescriptor index in the relational database, the descriptor indexallowing to store additional information for every descriptor.
 5. Methodaccording to claim 4, wherein the descriptor index comprises at leastdescriptor numbers, absolute positions of the descriptors within therelations and/or unique identifiers for the descriptors.
 6. Methodaccording to claim 1, wherein the hierarchical data format comprisingdescriptors corresponds to the Extensible Markup Language.
 7. Methodaccording to claim 1, wherein the common formats comprise at leastelements, attributes and text.
 8. Method according to claim 7, whereinthe common format text is divided into string values and integer values.9. Method according to claim 7 wherein the common formats furthercomprise namespace information.
 10. Database model for mapping ahierarchical data format comprising descriptors to a relational databasemanagement system, wherein it uses a method according to claim
 1. 11.Apparatus for reading from and/or writing to recording media, wherein ituses a method according to claim 1 or a database model for mapping ahierarchical data format comprising descriptors to a relational databasemanagement system.